This is an IPython Notebook to accompany the paper entitled 'The Architecture of Genome Wide Association Studies' by Melinda Mills and Charles Rahal. This allows us to share data analysis done in the construction of this data-driven review of all GWAS to date. It can also be used to dynamically replicate the analysis herein going forward. For installation guidelines, please see the readme.md which accompanies this repository. A preliminary point to note is that if you wish to run this .ipynb file itself, you may need to override the default settings of some versions of Jupyter Notebook (4.2 to 5.1) by opening with:
jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000000
or editing your jupyter_notebook_config.py file. Lets begin by loading in our favourite python modules, ipython magic and a bunch of custom functions we've written specifically for this project:
import networkx as nx
import os
import pandas as pd
import re
import seaborn as sns
import sys
import numpy as np
import csv
import itertools
import warnings
import gender_guesser.detector as gender
from Bio import Entrez
from IPython.display import HTML, display
from Support.LoadData import (
EFO_parent_mapper,
load_gwas_cat,
load_pubmed_data,
make_timely,
make_clean_CoR,
download_cat
)
from Support.PubMed import (
build_collective,
build_author,
build_funder,
build_abstract,
build_citation
)
from Support.Analysis import (
simple_facts,
ancestry_parser,
make_meanfemale_andranks,
make_funders,
mapped_trait_summary
)
from Support.Figures import (
gwas_growth,
choropleth_map,
wordcloud_figure,
plot_heatmap,
plot_bubbles,
boxswarm_plot
)
from Support.Additional import clean_names
from Support.Ancestry import ancestry_cleaner
warnings.filterwarnings('ignore')
%config InlineBackend.figure_format = 'png'
%matplotlib inline
%load_ext autoreload
%autoreload 2
Then, let's dynamically grab the three main curated datasets from the GWAS Catalogue EBI website that we will need for our endeavours ('All Associations','All Studies', 'All Ancestries') and the EFO to Parent trait mapping file from their FTP site:
output_path = os.path.abspath(os.path.join('__file__',
'../..',
'data',
'Catalogue',
'Raw'))
ebi_download = 'https://www.ebi.ac.uk/gwas/api/search/downloads/'
download_cat(output_path, ebi_download)
Lets link the PUBMEDID variable to the PUBMED API and get a series of datasets from that using the support functions written in PubMed.py. Note: collective corresponds mostly consortia and study groups.
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
id_list = set(Cat_Studies['PUBMEDID'].astype(str).tolist())
Entrez.email = 'charlierahal@gmail.com'
papers = Entrez.read(Entrez.efetch(db='pubmed',retmode='xml', id=','.join(id_list)))
build_collective(papers)
build_author(papers)
build_funder(papers)
build_abstract(papers)
build_citation(id_list,Entrez.email)
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
simple_facts(Cat_Studies, Cat_Ancestry,
Cat_Ancestry_groupedbyN, Cat_Full,
AuthorMaster)
Lets make a nice infographic and do some basic NLP on the abstracts from all PUBMED IDs. This figure is the header on the readme.md:
abstract_count = os.path.abspath(
os.path.join('__file__', '../..', 'data', 'PUBMED',
'Pubmed_AbstractCount.csv'))
output_file = os.path.abspath(
os.path.join('__file__', '../../', 'figures', 'pdf',
'helix_wordcloud_1250_5000_black.pdf'))
wordcloud_figure(abstract_count, output_file)
The file Pubmed_AbstractCount in PUBMED (subdirectory of Data) details a breakdown of the abstract counts. Check counts for 'European', 'Asian', etc.
We can also visualise the ever increasing sample sizes and the popularity of GWAS. The top panel shows the increasing number of GWAS conducted over time. We also see increasingly large sample sizes: i.e. the fraction of 'Big N' sample size studies increases. The bottom left subplot shows that with this growth comes bigger N and a high correlatation with the number of significant Associations found. The right subplot shows the unique number of journals publishing GWAS per year and the unique number of diseases or traits studied. Both steadily increase as the appeal of GWASs broadens.
output_file = os.path.abspath(
os.path.join('__file__', '../../', 'figures', 'svg',
'Figure_1.svg'))
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
gwas_growth(output_file, Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN)
This section of analysis only uses data contained within the catalogue, but is slightly deeper than Section 2 above and other related papers in the literature. We use the 'Broad Ancestral Category' field and aggregate from 135 combinations of 17 ancestries to 7 'broader ancestral categories' to calculate a comparable measure to Popejoy and Fullerton. To get a more detailed analysis of polyvocality, we load in some of our supporting functions to clean up the "INITIAL SAMPLE SIZE" and "REPLICATION SAMPLE SIZE" free-text fields in the catalogue and then calculate something analogous to Panofsky and Bliss. The supporting functions are based on regular expression and data wrangling techniques which exploit patterns in the these free-text fields. By way of a very simple example: “19,546 British ancestry individuals from 6863 families.” will get cleaned to two seperate fields: “19,546” and “British” which can then be used for further downstream analysis. A slightly more complex example: “2,798 European ancestry individuals, 228 French Canadian founder population individuals” will correspond to two entries of 2798 and 228 in the new ‘Cleaned N’ type variable, corresponding to ‘European’ and ‘French Canadian’ in the ‘Cleaned Ancestry’ type variable respectively. Uncomment the appropriate lines of text to get the full lists.
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
Cat_Studies['InitialClean'] = Cat_Studies.apply(
lambda row: ancestry_cleaner(row, 'INITIAL SAMPLE SIZE'), axis=1)
output_path = os.path.abspath(
os.path.join('__file__',
'../..',
'data',
'Catalogue',
'Synthetic',
'new_initial_sample.csv'))
ancestry_parser(output_path, 'InitialClean', Cat_Studies)
Cat_Studies['ReplicationClean'] = Cat_Studies.apply(
lambda row: ancestry_cleaner(row, 'REPLICATION SAMPLE SIZE'), axis=1)
output_path = os.path.abspath(
os.path.join('__file__',
'../..',
'data',
'Catalogue',
'Synthetic',
'new_replication_sample.csv'))
ancestry_parser(output_path, 'ReplicationClean', Cat_Studies)
clean_intial = pd.read_csv(os.path.abspath(
os.path.join('__file__', '../..', 'data',
'Catalogue', 'Synthetic',
'new_initial_sample.csv')),
encoding='utf-8')
clean_initial_sum = pd.DataFrame(
clean_intial.groupby(['Cleaned Ancestry']).sum())
clean_initial_sum.rename(
columns={'Cleaned Ancestry Size': 'Initial Ancestry Sum', }, inplace=True)
clean_initial_count = clean_intial.groupby(['Cleaned Ancestry']).count()
clean_initial_count.rename(
columns={'Cleaned Ancestry Size': 'Initial Ancestry Count', }, inplace=True)
clean_initial_merged = clean_initial_sum.merge(pd.DataFrame(
clean_initial_count['Initial Ancestry Count']), how='outer', left_index=True,
right_index=True)
clean_initial_merged = clean_initial_merged.sort_values(
by='Initial Ancestry Sum', ascending=False)
holder = ''
for index, row in clean_initial_merged.iterrows():
holder = holder + \
str(index) + ' (' + str(row['Initial Ancestry Sum']) + \
',' + str(row['Initial Ancestry Count']) + '), '
print('There are a total of ' + str(len(clean_initial_merged)) +
' ancestries found in the \'INITIAL SAMPLE SIZE\' column.' +
'\nThese are: (number of people used used in all studies, ' +
'number of studies included):\n\n' + holder[:-2] + '.\n\n')
clean_replication = pd.read_csv(os.path.abspath(
os.path.join('__file__',
'../..',
'data',
'Catalogue',
'Synthetic',
'new_replication_sample.csv')),
encoding='utf-8')
clean_replication_sum = pd.DataFrame(
clean_replication.groupby(['Cleaned Ancestry']).sum())
clean_replication_sum.rename(
columns={'Cleaned Ancestry Size': 'Replication Ancestry Sum', }, inplace=True)
clean_replication_count = clean_replication.groupby(
['Cleaned Ancestry']).count()
clean_replication_count.rename(
columns={'Cleaned Ancestry Size': 'Replication Ancestry Count', }, inplace=True)
clean_replication_merged = clean_replication_sum.merge(
pd.DataFrame(clean_replication_count['Replication Ancestry Count']),
how='outer', left_index=True, right_index=True)
clean_replication_merged = clean_replication_merged.sort_values(
by='Replication Ancestry Sum', ascending=False)
holder = ''
for index, row in clean_replication_merged.iterrows():
holder = holder + \
str(index) + ' (' + str(row['Replication Ancestry Sum']) + \
',' + str(row['Replication Ancestry Count']) + '), '
print('There are a total of ' + str(len(clean_replication_merged)) +
' ancestries found in the \'REPLICATION SAMPLE SIZE\' column.' +
' These are (number of people used used in all studies, number of' +
' studies included):\n\n' + holder[:-2] + '.')
Lets aggregate to create a dataframe for natives/indigenous people using a dictionary based classifier
native_dict = pd.read_csv(os.path.abspath(os.path.join('__file__', '../..',
'data', 'Support',
'native_classifier_dictionary.csv')),
index_col='Cleaned Ancestry')
native_df = pd.merge(clean_replication_merged, clean_initial_merged,
left_index=True,right_index=True, how='outer')
if len(np.setdiff1d(native_df.index.values, native_dict.index.values))>0:
print('There are still ancestries which need classifying as unique!'+
' Add them to the dictionary: ' +
', '.join(np.setdiff1d(native_df.index.values,native_dict.index.values)))
else:
print('Congratulations! The Native/Indigenous dictionary is fully up to date!')
native_df = native_df.fillna(0)
native_df['Total Ancestry Sum'] = native_df['Replication Ancestry Sum'] + \
native_df['Initial Ancestry Sum']
native_df['Total Ancestry Count'] = native_df['Replication Ancestry Count'] + \
native_df['Initial Ancestry Count']
native_df = pd.merge(native_df,native_dict,left_index=True, right_index=True)
print('We observe ' + str(len(native_df)) + ' unique terms for ancestry/ethnicity in total.')
print('Of these, ' + str(len(native_df[native_df['Native Population']!='Not Native'])) +
' relate to native or Indigenous populations ancestry/ethnicity.')
print('These relate to ' +
str(len(native_df[native_df['NativeType']=='Indicated']['Native Population'].unique())) +
' specific native/indigenous groups as implied by the free text.')
print('These are: '+
', '.join(native_df[native_df['NativeType']=='Indicated']['Native Population'].unique()))
print('Out of ' + str(int(native_df['Total Ancestry Count'].sum())) +
' mentions of ancestry\ethnicity, ' +
str(int(native_df[native_df['NativeType']=='Indicated']['Total Ancestry Count'].sum())) +
' relate to native/indigenous groups as implied by the free text.' +
'('+
str(round(native_df[native_df['NativeType']=='Indicated']['Total Ancestry Count'].sum()/
native_df['Total Ancestry Count'].sum()*100,3)) +'%)')
print('Out of ' + str(int(native_df['Total Ancestry Sum'].sum())) +
' participants, ' +
str(int(native_df[native_df['NativeType']=='Indicated']['Total Ancestry Sum'].sum())) +
' are of native/indigenous groups as implied by the free text.' +
'('+
str(round(native_df[native_df['NativeType']=='Indicated']['Total Ancestry Sum'].sum()/
native_df['Total Ancestry Sum'].sum()*100,3)) +'%)')
print('Most commonly used natives implied by the free text is: ' +
native_df[native_df['NativeType']=='Indicated']['Total Ancestry Sum'].sort_values(ascending=False).index.values[0] +
' which contributed ' +
str(int(native_df[native_df['NativeType']=='Indicated']['Total Ancestry Sum'].sort_values(ascending=False)[0]))+
' people in terms of sample.')
print('These relate to ' +
str(len(native_df[native_df['NativeType']!='Not Native']['Native Population'].unique())) +
' specific native/indigenous implied by the free text or which are manually classified.')
print('These are: '+
', '.join(native_df[native_df['NativeType']!='Not Native']['Native Population'].unique()))
print('Out of ' + str(int(native_df['Total Ancestry Count'].sum())) +
' mentions of ancestry\ethnicity, ' +
str(int(native_df[native_df['NativeType']!='Not Native']['Total Ancestry Count'].sum())) +
' relate to native/indigenous populations implied by the free text or which are manually classified.' +
'('+
str(round(native_df[native_df['NativeType']!='Not Native']['Total Ancestry Count'].sum()/
native_df['Total Ancestry Count'].sum()*100,3)) +'%)')
print('Out of ' + str(int(native_df['Total Ancestry Sum'].sum())) +
' participants, ' +
str(int(native_df[native_df['NativeType']!='Not Native']['Total Ancestry Sum'].sum())) +
' are of native/indigenous populations implied by the free text or which are manually classified.' +
'('+
str(round(native_df[native_df['NativeType']!='Not Native']['Total Ancestry Sum'].sum()/
native_df['Total Ancestry Sum'].sum()*100,3)) +'%)')
print('Most commonly used natives implied by the free text or which are manually classified is: ' +
native_df[native_df['NativeType']!='Not Native']['Total Ancestry Sum'].sort_values(ascending=False).index.values[0] +
' which contributed ' +
str(int(native_df[native_df['NativeType']!='Not Native']['Total Ancestry Sum'].sort_values(ascending=False)[0]))+
' people in terms of sample.')
Lets first now do a quick check that our 'Broader' dictionary is up to date:
if len(Cat_Ancestry[Cat_Ancestry['Broader'].isnull()]) > 0:
print('Perhaps we need to update the dictionary for new terms?')
print('Something like these ones: ' +
','.join(Cat_Ancestry[Cat_Ancestry['Broader'].isnull()]['BROAD ANCESTRAL'].unique()))
else:
print('Nice! Looks like our Broad to Broader dictionary is up to date!')
Lets put this number into tables to get a deeper understanding, first including a row which has at least one ancestry as NR, and then dropping all rows in which at least one recorded ancestry is NR:
Broad_Ancestral_Full_initial = pd.DataFrame(
Cat_Ancestry[Cat_Ancestry.STAGE ==
'initial'].groupby(['Broader'])['N'].sum())
Broad_Ancestral_Full_initial.rename(
columns={'N': 'N (Initial)'}, inplace=True)
Broad_Ancestral_Full_replication = pd.DataFrame(
Cat_Ancestry[Cat_Ancestry.STAGE ==
'replication'].groupby(['Broader'])['N'].sum())
Broad_Ancestral_Full_replication.rename(
columns={'N': 'N (Replication)'}, inplace=True)
Broad_Ancestral_Full = pd.merge(Broad_Ancestral_Full_initial,
Broad_Ancestral_Full_replication,
left_index=True, right_index=True)
Broad_Ancestral_Full['Total'] = Broad_Ancestral_Full['N (Initial)'] + \
Broad_Ancestral_Full['N (Replication)']
Broad_Ancestral_Full['% Discovery'] = (
Broad_Ancestral_Full['N (Initial)'] /
Broad_Ancestral_Full['N (Initial)'].sum()) * 100
Broad_Ancestral_Full['% Replication'] = (
Broad_Ancestral_Full['N (Replication)'] /
Broad_Ancestral_Full['N (Replication)'].sum()) * 100
Broad_Ancestral_Full['% Total'] = (Broad_Ancestral_Full['Total'] /
Broad_Ancestral_Full['Total'].sum()) * 100
Broad_Ancestral_Full.to_csv(os.path.abspath(
os.path.join('__file__',
'../..',
'tables',
'Broad_Ancestral_Full.csv')))
Broad_Ancestral_Full
Drop the 'in part not recorded' rows (i.e. where the Broad Ancestral Category contains at least one NR):
Broad_Ancestral_NoNR = Broad_Ancestral_Full[['N (Initial)',
'N (Replication)',
'Total']]
Broad_Ancestral_NoNR = Broad_Ancestral_NoNR.drop('In Part Not Recorded')
Broad_Ancestral_NoNR['Total'] = Broad_Ancestral_NoNR['N (Initial)'] + \
Broad_Ancestral_NoNR['N (Replication)']
Broad_Ancestral_NoNR['% Discovery'] = (
Broad_Ancestral_NoNR['N (Initial)'] /
Broad_Ancestral_NoNR['N (Initial)'].sum()) * 100
Broad_Ancestral_NoNR['% Replication'] = (
Broad_Ancestral_NoNR['N (Replication)'] /
Broad_Ancestral_NoNR['N (Replication)'].sum()) * 100
Broad_Ancestral_NoNR['% Total'] = (Broad_Ancestral_NoNR['Total'] /
Broad_Ancestral_NoNR['Total'].sum()) * 100
Broad_Ancestral_NoNR.to_csv(os.path.abspath(
os.path.join('__file__', '../..', 'tables',
'Broad_Ancestral_NoNR.csv')))
Broad_Ancestral_NoNR
What are the broad ancestral fields before we visualise the broader field? Dropping all rows for multiple (comma separated entries):
Lets now make some bubble plots to visualise the raw broad ancestry field and our fields extracted from the free text strings:
GroupAnc = Cat_Ancestry[(Cat_Ancestry['BROAD ANCESTRAL'] != 'Other') &
(Cat_Ancestry['BROAD ANCESTRAL'].str.count(',') == 0)].\
groupby(['BROAD ANCESTRAL'])['N'].sum().to_frame()
GroupAnc['Individuals (%)'] = (GroupAnc['N'] /
GroupAnc['N'].sum()) * 100
GroupAnc.sort_values(by='Individuals (%)',ascending=False)[0:10]
Now lets visualise the occurances of the 'Broader' ancestry conversion:
countriesdict = {'African': '#4daf4a', 'African Am./Caribbean': '#984ea3',
'Asian': '#e41a1c', 'European': '#377eb8',
'Hispanic/Latin American': '#ff7f00', 'Other/Mixed': '#ffff33'}
output_path = os.path.abspath(os.path.join('__file__', "../..", "figures", "svg",
"Figure_2.svg"))
plot_bubbles(output_path, Cat_Ancestry, Broad_Ancestral_NoNR, countriesdict)
We can also analyze how these aggregates change over time, and this was a key feature of Popejoy and Fullerton (2016). We can provide a much more granular arguement (merely remove '_NoNR' from the cell below to undertake an equivlent analysis with rows which contain some in part NR ancestries):
index = [x for x in range(2007, 2018)]
columns = ['European', 'Asian', 'African', 'Hispanic/Latin American',
'Other/Mixed', 'African Am./Caribbean']
Cat_Ancestry_NoNR = Cat_Ancestry[
Cat_Ancestry['Broader'] != 'In Part Not Recorded']
Broad_Ancestral_Time_NoNR_PC = pd.DataFrame(index=index, columns=columns)
for year in range(2007, 2018):
for broaderancestry in Broad_Ancestral_NoNR.index.tolist():
Broad_Ancestral_Time_NoNR_PC[broaderancestry.strip()][year] =\
(Cat_Ancestry_NoNR[(
Cat_Ancestry_NoNR['DATE'].str.contains(str(year))) &
(Cat_Ancestry_NoNR['Broader'] ==
broaderancestry)]['N'].sum() /
Cat_Ancestry_NoNR[
(Cat_Ancestry_NoNR['DATE'].str.contains(
str(year)))]['N'].sum()) * 100
Broad_Ancestral_Time_NoNR_PC.to_csv(os.path.abspath(
os.path.join('__file__', '../..', 'tables',
'Broad_Ancestral_Time_NoNR_PC.csv')))
Broad_Ancestral_Time_NoNR_PC.head(12)
We can focus on the initial discovery stage to calculate the number of individuals required per ancestry to unconver one hit. Note however that this does require some distributional assumptions (i.e. that the same diseases are being studied across ancestries).
Cat_Ancestry_Initial = Cat_Ancestry[Cat_Ancestry['STAGE'] == 'initial']
Cat_Ancestry_NoDups = Cat_Ancestry_Initial.drop_duplicates(
subset=['STUDY ACCESSION'],
keep=False)[['PUBMEDID', 'STUDY ACCESSION',
'BROAD ANCESTRAL',
'N']]
Cat_Ancestry_NoDups_merge = pd.merge(Cat_Ancestry_NoDups,
Cat_Studies[['ASSOCIATION COUNT',
'STUDY ACCESSION',
'MAPPED_TRAIT']],
how='left', on='STUDY ACCESSION')
listtoiterate = ['European', 'East Asian', 'South Asian',
'African Am.\Caribbean', 'Hispanic\Latin American']
for ancestry in listtoiterate:
temp = Cat_Ancestry_NoDups_merge[Cat_Ancestry_NoDups_merge[
'BROAD ANCESTRAL'].str.strip() == ancestry]
print('The number of ' + ancestry + 's required to find one hit: ' +
str(round(1 /
(temp['ASSOCIATION COUNT'].sum() / temp['N'].sum()), 1)))
This is a choropleth map of country of recruitment - note that this field is not quite fully populated in the Catalogue. Basemap code is loosely based on this. As we want to study the number of people recruited from each country, we can only utilize values in the ‘Country of Recruitment’ field when only one country is mentioned. For example: if N=100,000 and Country of Recruitment = {U.K., U.S.}, there is no way for us to know what the breakdown between the two countries is in the Catalogue (especially as the free text field may just contain 'European' ancestry). Our only option in this scenario is to drop such observations. This forces us to drop about 22% of all the studies, and this corresponds to about 48% of the Catalogue as measured by N (because bigger studies have multiple countries of recruitment). We are faced with multiple entries equal to ‘NR’, which corresponds to ‘not recorded’. This reduces the number of studies to go into the map by a further 21% (of the initial total), and this means that the map only includes about half of all GWAS studies. This has no relationship to whether ancestry is coded.
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
Clean_CoR = make_clean_CoR(Cat_Ancestry)
Cleaned_CoR_N = pd.DataFrame(Clean_CoR.groupby(['Cleaned Country'])['N'].sum())
Cleaned_CoR_N['Rank_N'] = Cleaned_CoR_N.rank(
ascending=False, method='dense').astype(int)
Cleaned_CoR_count = pd.DataFrame(
Clean_CoR.groupby(['Cleaned Country'])['N'].count())
Cleaned_CoR_count['Rank_count'] = Cleaned_CoR_count.rank(
ascending=False, method='dense').astype(int)
Cleaned_CoR_count = Cleaned_CoR_count.rename(columns={'N': 'Count'})
Cleaned_CoR = pd.merge(Cleaned_CoR_count, Cleaned_CoR_N,
left_index=True, right_index=True)
del Cleaned_CoR.index.name
Cleaned_CoR['Count (%)'] = round((Cleaned_CoR['Count'] /
Cleaned_CoR['Count'].sum()) * 100,2)
Cleaned_CoR['N (%)'] = round((Cleaned_CoR['N'] /
Cleaned_CoR['N'].sum()) * 100,2)
countrylookup = pd.read_csv(os.path.abspath(
os.path.join('__file__',
'../..',
'data',
'ShapeFiles',
'Country_Lookup.csv')),
index_col='Country')
country_merged = pd.merge(countrylookup,
Cleaned_CoR,
left_index=True,
right_index=True)
country_merged['Per Rec'] = round(country_merged['N']/pd.to_numeric(country_merged['2017population']),2)
country_merged.sort_values('Rank_N', ascending=True)[
['Continent', 'Count', 'N', 'Count (%)', 'N (%)', 'Per Rec']].to_csv(
os.path.abspath(os.path.join('__file__',
'../..',
'tables',
'CountryOfRecruitment.csv')))
country_merged.sort_values('Rank_N', ascending=True)[
['Continent', 'Count', 'N', 'Count (%)', 'N (%)', 'Per Rec']].head(10)
output_path = os.path.abspath(os.path.join('__file__',
'../..',
'figures',
'svg',
'Figure_3.svg'))
choropleth_map(country_merged, 'N', 'OrRd', output_path)
output_path = os.path.abspath(os.path.join('__file__',
'../..',
'figures',
'svg',
'Figure_3_Blues.svg'))
choropleth_map(country_merged, 'Per Rec', 'Blues', output_path)
Lets now merge that via a country lookup to continent based file using data from within the shapefile itself (based on CIAWF).
country_merged_sumcount = country_merged[[
'Continent', 'Count']].groupby(['Continent']).sum()
country_merged_sumN = country_merged[[
'Continent', 'N']].groupby(['Continent']).sum()
country_merged_sums = pd.merge(
country_merged_sumN, country_merged_sumcount,
left_index=True, right_index=True)
country_merged_sums['N (%)'] = (
country_merged_sums['N'] / country_merged_sums['N'].sum()) * 100
country_merged_sums['Count (%)'] = (
country_merged_sums['Count'] / country_merged_sums['Count'].sum()) * 100
country_merged_sums.to_csv(os.path.abspath(
os.path.join('__file__',
'../..',
'tables',
'ContinentOfRecruitment.csv')))
country_merged_sums
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
AllFunders = FunderInfo.groupby(by='Agency').count()
AllFunders.index.name = None
AllFunders = AllFunders.reset_index()
AllFunders = AllFunders.rename(columns={
'index': 'Agency',
'PUBMEDID': 'Grant Contributions',
'GrantCountry': 'Country'})
AllFunders_withcountries = pd.merge(AllFunders[['Agency',
'Grant Contributions']],
FunderInfo[['Agency',
'GrantCountry']].drop_duplicates(
'Agency'),
on='Agency', how='left')
AllFunders_withcountries = AllFunders_withcountries.set_index('Agency')
AllFunders_withcountries.index.name = None
AllFunders_withcountries['% of Total'] = round((
AllFunders_withcountries['Grant Contributions'] /
AllFunders_withcountries['Grant Contributions'].sum()) * 100, 2)
AllFunders_withcountries.sort_values(
'Grant Contributions', ascending=False)[0:20]
Lets print out some simple descriptives from this data:
print('There are ' + str(len(FunderInfo['Agency'].drop_duplicates())) +
' unique funders returned from PubMed Central.')
print('There are ' + str(len(FunderInfo['GrantID'].drop_duplicates())) +
' unique grants returned from PubMed Central.')
print('There are ' + str(len(FunderInfo['GrantCountry'].drop_duplicates())) +
' unique grant countries returned from PubMed Central.')
print('Each study has an average of ' +
str(round(len(FunderInfo) / len(id_list), 2)) + ' grants funding it.')
grantgroupby = FunderInfo.groupby(['Agency', 'GrantID']).size().groupby(
level=1).max().sort_values(ascending=False).reset_index()
print('The most frequently acknowledged grant is GrantID ' +
grantgroupby['GrantID'][0] + '.\nThis grant is acknowledged ' +
str(grantgroupby[0][0]) + ' times.')
From which countries do these grants come from?
TopCountryFunders = FunderInfo.groupby(by='GrantCountry').count()
TopCountryFunders = TopCountryFunders.rename(
columns={'PUBMEDID': 'Number Of Studies'})
TopCountryFunders = TopCountryFunders.sort_values(
'Number Of Studies', ascending=False)[['Number Of Studies']]
TopCountryFunders = TopCountryFunders.reset_index().rename(
columns={'GrantCountry': 'Country of Funder'})
TopCountryFunders['Percent Funded (%)'] = (
TopCountryFunders['Number Of Studies'] /
TopCountryFunders['Number Of Studies'].sum()) * 100
TopCountryFunders = TopCountryFunders.set_index('Country of Funder')
TopCountryFunders.index.name = None
TopCountryFunders.head(10)
Lets first get a metric about splitting on the commas in the EFO field. We can make identical heatmaps by just replacing EFO_Parent_Mapped with EFO_Parent_Mapped_NoCommas. This is inherently making the assumption that a GWAS which has two mentioned EFO contributes equally to each of them: as much as a GWAS with one EFO (i.e. no comma to split on) contributes to that one specific EFO.
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(
Cat_Studies, Cat_Ancestry_groupedbyN)
print('There are ' + str(len(EFO_Parent_Mapped)) +
' rows of EFO terms after we split for commas.')
print('This indicates ' + str(len(EFO_Parent_Mapped) -\
len(EFO_Parent_Mapped_NoCommas)) +
' additional terms were mapped than for when we drop csvs.')
print('This indicates ' +
str(len(EFO_Parent_Mapped['EFO term'].drop_duplicates())) +
' unique EFO terms to map to Parents.')
print('This is in comparison to ' +
str(len(EFO_Parent_Mapped_NoCommas['EFO term'].drop_duplicates())) +
' unique EFO terms in Cat_Studies.')
mapped_trait_summary(EFO_Parent_Mapped, 'EFO term')
mapped_trait_summary(EFO_Parent_Mapped_NoCommas,'EFO term')
mapped_trait_summary(EFO_Parent_Mapped, 'Parent term')
mapped_trait_summary(EFO_Parent_Mapped_NoCommas, 'Parent term')
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(
Cat_Studies, Cat_Ancestry_groupedbyN)
EFO_Parent_Mapped['Parent term'] = \
EFO_Parent_Mapped['Parent term'].str.replace('Disorder', '')
FunderInfo_Parent = pd.merge(FunderInfo, EFO_Parent_Mapped,
left_on='PUBMEDID', right_on='PUBMEDID',
how='left')
funder_ancestry, funder_parent = make_funders(FunderInfo_Parent, FunderInfo, Cat_Ancestry)
output_path = os.path.abspath(os.path.join('__file__',
'../..',
'figures',
'svg',
'Figure_4.svg'))
plot_heatmap(funder_ancestry, funder_parent, output_path)
Lets do the same thing, but instead drop not split on the EFO commas:
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(
Cat_Studies, Cat_Ancestry_groupedbyN)
EFO_Parent_Mapped_NoCommas['Parent term'] = \
EFO_Parent_Mapped_NoCommas['Parent term'].str.replace('Disorder', '')
FunderInfo_Parent = pd.merge(FunderInfo, EFO_Parent_Mapped_NoCommas,
left_on='PUBMEDID', right_on='PUBMEDID',
how='left')
funder_ancestry, funder_parent = make_funders(FunderInfo_Parent, FunderInfo, Cat_Ancestry)
output_path = os.path.abspath(os.path.join('__file__', '../..', 'figures',
'svg', 'Figure_4_NoCommas.svg'))
plot_heatmap(funder_ancestry, funder_parent, output_path)
Who are the most cited authors? What is their 'GWAS H-Index' calculated based only on their papers in the GWAS catalogue? This assumes unique forename + surname combinations, and that the same author is recorded consistently across studies. First a couple of snippets:
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
print('There are a total of ' + str(len(AuthorMaster)) +
' "authorship contributions".')
print('These contributions are made by ' +
str(len(AuthorMaster['AUTHORNAME'].unique())) + ' unique authors.')
print('There are a total of ' +
str(len(AuthorMaster) /
len(AuthorMaster['PUBMEDID'].drop_duplicates())) +
' "authorship contributions per paper".')
print('The study with the most number of authors has ' +
str(AuthorMaster.groupby(['PUBMEDID']).size().max()) +
' authors (PubMed ID: ' +
str(AuthorMaster.groupby(['PUBMEDID']).size().idxmax()) + ')')
Lets then calculate our GWAS-H indexes using citation data and paper counts for unique authors.
CitationCounts = pd.read_csv(os.path.abspath(
os.path.join('__file__', "../..",
"data", "PUBMED", 'Pubmed_Cites.csv')))
AuthorMaster_withcites = pd.merge(
AuthorMaster, CitationCounts, on=['PUBMEDID'], how='left')
allauthors_formerge = pd.DataFrame(
AuthorMaster_withcites[['AUTHORNAME', 'citedByCount']])
allauthors_papercount = pd.DataFrame(
allauthors_formerge['AUTHORNAME'].value_counts())
allauthors_citecount = pd.DataFrame(
allauthors_formerge.groupby(by='AUTHORNAME')['citedByCount'].sum())
allauthors_merged = pd.merge(
allauthors_papercount, allauthors_citecount, left_index=True,
right_index=True)
allauthors_merged.columns = ['Papers', 'citedByCount']
allauthors_merged = allauthors_merged.sort_values(
by='citedByCount', ascending=False)
allauthors_merged['GWAS-H'] = np.NaN
counter = 0
for author in allauthors_merged.index:
counter += 1
temp = AuthorMaster_withcites[AuthorMaster_withcites['AUTHORNAME'] ==
author].sort_values(by='citedByCount',
ascending=False).dropna()
temp = temp.reset_index()
temp = temp.drop('index', 1)
for pubnumber in range(0, len(temp)):
if pubnumber + 1 >= temp['citedByCount'][pubnumber]:
allauthors_merged.loc[author, ('GWAS-H')] = int(pubnumber)
break
sys.stdout.write('\r' + 'Calculating GWAS H-indices: finished ' +
str(counter) + ' of ' +
str(len(allauthors_merged.index)) + ' authors...')
allauthors_citecount.reset_index(inplace=True)
allauthors_papercount.reset_index(inplace=True)
allauthors_papercount.rename(
columns={'AUTHORNAME': 'PAPERCOUNT', 'index': 'AUTHORNAME', },
inplace=True)
Lets next calculate some measures of authorship centrality with Network-X. Note: this can take some time on slow computers. To get it to run faster, change the CitedByCount to be something like 100 to filter for authors with a minimum of a hundred citations only (or change PAPERCOUNT>1)
AuthorMaster_sumcites = pd.merge(AuthorMaster, allauthors_citecount,
left_on='AUTHORNAME', right_on='AUTHORNAME',
how='left')
AuthorMaster_sumcitespapercount = pd.merge(AuthorMaster_sumcites,
allauthors_papercount,
left_on='AUTHORNAME',
right_on='AUTHORNAME', how='left')
AuthorMaster_sumcitespapercount_filter_cites = AuthorMaster_sumcitespapercount[
AuthorMaster_sumcitespapercount['citedByCount'] > 10]
AuthorMaster_sumcitespapercount_filtered =\
AuthorMaster_sumcitespapercount_filter_cites[
AuthorMaster_sumcitespapercount_filter_cites['PAPERCOUNT'] > 1]
G = nx.Graph()
counter = 0
alledges = []
for paper in AuthorMaster_sumcitespapercount_filtered['PUBMEDID'].unique():
counter += 1
temp = AuthorMaster_sumcitespapercount_filtered[
AuthorMaster_sumcitespapercount_filtered['PUBMEDID'] == paper]
if len(temp) > 1:
templist = list(itertools.combinations(temp.AUTHORNAME, 2))
for edge in templist:
alledges.append(edge)
G.add_edges_from(list(set(alledges)))
print('This gives us a network with ' + str(len(G)) +
' nodes.\nThese are unique authors with >1 paper and >10 cites')
betcent = pd.Series(nx.betweenness_centrality(G), name='Betweenness')
allauthors_merged = allauthors_merged.merge(
betcent.to_frame(), left_index=True, right_index=True)
degcent = pd.Series(nx.degree_centrality(G), name='Degree')
allauthors_merged = allauthors_merged.merge(
degcent.to_frame(), left_index=True, right_index=True)
We can then merge all this data with some data collected manually from web searches related to their country of employment, their current employer, etc. We then rank in three different ways to analyze overlap between the three metrics.
authorsupplemental = pd.read_csv(os.path.abspath(
os.path.join('__file__', "../..",
"Data", "Support", 'Author_Supplmentary.csv')),
encoding='latin-1', index_col=0)
allauthors_merged_withsupp = pd.merge(allauthors_merged,
authorsupplemental,
left_index=True, right_index=True,
how='left',)
allauthors_merged_withsupp['Betweenness'] = round(allauthors_merged_withsupp['Betweenness'],3)
allauthors_merged_withsupp['Degree'] = round(allauthors_merged_withsupp['Degree'],3)
allauthors_merged_withsupp['CitePercentile'] = round(allauthors_merged_withsupp.citedByCount.rank(pct=True),4)
allauthors_merged_withsupp.sort_values(by='GWAS-H',
ascending=False).to_csv(
os.path.abspath(os.path.join('__file__',
"../..",
"tables", "Authors.csv")))
allauthors_merged_withsupp.sort_values(by='GWAS-H', ascending=False).head(20)
allauthors_merged_withsupp.sort_values(by='Degree',ascending=False).head(10)
allauthors_merged_withsupp.sort_values(by='citedByCount',
ascending=False).head(10)
And then make a simple correlation matrix to check how highly related these metrics are:
allauthors_merged_withsupp[['citedByCount', 'GWAS-H',
'Betweenness', 'Degree', 'Papers']].corr()
print('There are a total of ' +
str(len(allauthors_merged_withsupp)) + ' authors in the table.')
print('The person with the highest G-WAS H-Index is: ' +
allauthors_merged_withsupp['GWAS-H'].idxmax())
print('The person with the highest Degree is: ' +
allauthors_merged_withsupp['Degree'].idxmax())
print('The person with the highest citedByCount is: ' +
allauthors_merged_withsupp['citedByCount'].idxmax())
print('The person with the most number of Papers is: ' +
allauthors_merged_withsupp['Papers'].idxmax())
Lets consider the gender of each author by analyzing their forenames using the genderguesser library. We can directly compare our results to this wonderful paper. One of the key assumptions made here is to combine all 'mostly_' male and female names into their respective male/female categories.
gendet = gender.Detector()
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
AuthorCounts = AuthorMaster.groupby(['PUBMEDID'])['AUTHORNAME'].count(
).to_frame().reset_index().rename(columns={"AUTHORNAME": "Author Count"})
AuthorMaster = pd.merge(AuthorMaster, AuthorCounts, how='left', on='PUBMEDID')
AuthorMaster['CleanForename'] = AuthorMaster['FORENAME'].map(
lambda x: clean_names(x))
AuthorMaster['CleanGender'] = AuthorMaster['CleanForename'].map(
lambda x: gendet.get_gender(x))
AuthorMaster['MaleFemale'] = AuthorMaster['CleanGender'].str.replace('mostly_',
'')
AuthorMaster['isfemale'] = np.where(
AuthorMaster['MaleFemale'] == 'female', 1, 0)
AuthorFirst = AuthorMaster[AuthorMaster['Author Count']
> 4].drop_duplicates(subset='PUBMEDID', keep='first')
AuthorLast = AuthorMaster[AuthorMaster['Author Count']
> 4].drop_duplicates(subset='PUBMEDID', keep='last')
AuthorUnique = AuthorMaster.drop_duplicates(subset='AUTHORNAME')
for gender_ in AuthorMaster['CleanGender'].unique():
print(str(round((len(AuthorMaster[AuthorMaster['CleanGender'] == gender_]) /
len(AuthorMaster['CleanGender'])) * 100, 2)) + '%' +
' ' + gender_ + ' authorship contributions in total')
print(str(round((len(AuthorUnique[AuthorUnique['CleanGender'] == gender_]) /
len(AuthorUnique['CleanGender'])) * 100, 2)) + '%' +
' ' + gender_ + ' authors contributions in total')
print(str(round((len(AuthorFirst[AuthorFirst['CleanGender'] == gender_]) /
len(AuthorFirst['CleanGender'])) * 100, 2)) + '%' +
' ' + gender_ + ' first authors in total')
print(str(round((len(AuthorLast[AuthorLast['CleanGender'] == gender_]) /
len(AuthorFirst['CleanGender'])) * 100, 2)) + '%' +
' ' + gender_ + ' last authors in total')
print('\nPercent of male author contributions: ' +
str(round(len(AuthorMaster[AuthorMaster['MaleFemale'] == 'male']) /
(len(AuthorMaster[(AuthorMaster['MaleFemale'] == 'male') |
(AuthorMaster['MaleFemale'] == 'female')])) *
100, 3)) + '%')
print('Percent of unique male authors: ' +
str(round(len(AuthorUnique[AuthorUnique['MaleFemale'] == 'male']) /
(len(AuthorUnique[(AuthorUnique['MaleFemale'] == 'male') |
(AuthorUnique['MaleFemale'] == 'female')])) *
100, 3)) + '%')
print('Percent of male first authors: ' +
str(round(len(AuthorFirst[AuthorFirst['MaleFemale'] == 'male']) /
(len(AuthorFirst[(AuthorFirst['MaleFemale'] == 'male') |
(AuthorFirst['MaleFemale'] == 'female')])) *
100, 3)) + '%')
print('Percent of male last authors: ' +
str(round(len(AuthorLast[AuthorLast['MaleFemale'] == 'male']) /
(len(AuthorLast[(AuthorLast['MaleFemale'] == 'male') |
(AuthorLast['MaleFemale'] == 'female')])) *
100, 3)) + '%')
AuthorMaster_filtered = AuthorMaster[(AuthorMaster['MaleFemale'].str.contains(
'male')) & (AuthorMaster['Author Count'] > 4)]
AuthorMaster_filtered_merged_bydisease = pd.merge(
AuthorMaster_filtered, Cat_Studies[['PUBMEDID', 'DISEASE/TRAIT']],
how='left', on='PUBMEDID')
meanfemales_bydisease = AuthorMaster_filtered_merged_bydisease.groupby(
['DISEASE/TRAIT'])['isfemale'].mean().to_frame()
numberofstudies_bydisease = Cat_Studies.groupby(
['DISEASE/TRAIT'])['DISEASE/TRAIT'].count().to_frame()
mergedgender_andcount_bydisease = pd.merge(
numberofstudies_bydisease, meanfemales_bydisease, left_index=True,
right_index=True)
mergedgender_andcount_bydisease = mergedgender_andcount_bydisease.sort_values(
by='DISEASE/TRAIT', ascending=False)[0:10]
holdstring = 'Percent of authorships across 10 most commonly studied diseases:\n'
for index, row in mergedgender_andcount_bydisease.iterrows():
holdstring = holdstring + \
index.title() + ' (' + str(round(row['isfemale'], 3) * 100) + '%)\n'
print('\n' + holdstring[:-2])
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(
Cat_Studies, Cat_Ancestry_groupedbyN)
ranks, countstudiesperEFO, meanfemale, AuthorMaster_EFO, AuthorMaster_Parent = make_meanfemale_andranks(AuthorMaster,
EFO_Parent_Mapped)
output_path = os.path.abspath(os.path.join('__file__', "../..", "figures",
'svg', 'Sup_Figure_1_Commas.svg'))
boxswarm_plot(meanfemale, ranks, output_path)
AuthorMaster_Parent.reset_index(inplace=True)
holdstring = 'Get the numbers for Parent Terms:\n'
for index, row in AuthorMaster_Parent.iterrows():
holdstring = holdstring + row['Parent term'].title() + ' (' \
+ str(round(row['isfemale'], 3) * 100) + '%)\n'
print('\n' + holdstring[:-2])
holdstring = 'Get the numbers for Trait Terms (figure above):\n'
for index, row in AuthorMaster_EFO[
AuthorMaster_EFO['EFO term'].isin(countstudiesperEFO)].sort_values(
by='isfemale', ascending=False).iterrows():
holdstring = holdstring + \
row['EFO term'].title() + ' (' + str(round(row['isfemale'], 3) *
100) + '%)\n'
print('\n' + holdstring[:-2])
Lets do exactly the same thing as above, but drop the fields with commas in:
ranks, countstudiesperEFO, meanfemale, AuthorMaster_EFO, AuthorMaster_Parent = make_meanfemale_andranks(AuthorMaster, EFO_Parent_Mapped_NoCommas)
output_path = os.path.abspath(os.path.join('__file__', "../..", "figures",
'svg', 'Sup_Figure_1_NoCommas.svg'))
boxswarm_plot(meanfemale, ranks, output_path)
AuthorMaster_Parent.reset_index(inplace=True)
holdstring = 'Get the numbers for Parent Terms:\n'
for index, row in AuthorMaster_Parent.iterrows():
holdstring = holdstring + row['Parent term'].title() + ' (' \
+ str(round(row['isfemale'], 3) * 100) + '%)\n'
print('\n' + holdstring[:-2])
holdstring = 'Get the numbers for Trait Terms (figure above):\n'
for index, row in AuthorMaster_EFO[
AuthorMaster_EFO['EFO term'].isin(countstudiesperEFO)].sort_values(
by='isfemale', ascending=False).iterrows():
holdstring = holdstring + \
row['EFO term'].title() + ' (' + str(round(row['isfemale'], 3) *
100) + '%)\n'
print('\n' + holdstring[:-2])
allauthors_merged_reset = allauthors_merged.reset_index().rename(columns={'index': 'Name'})
allauthors_merged_reset['CleanForename'] = allauthors_merged_reset['Name'].map(lambda x: x.split(' ')[0])
allauthors_merged_reset['CleanGender'] = allauthors_merged_reset['CleanForename'].map(
lambda x: gendet.get_gender(x))
allauthors_merged_reset['MaleFemale'] = allauthors_merged_reset['CleanGender'].str.replace('mostly_',
'')
allauthors_merged_reset['isfemale'] = np.where(
allauthors_merged_reset['MaleFemale'] == 'female', 1, 0)
print('The average GWAS-H Index for females is: ' +
str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='female']['GWAS-H'].mean(),3)))
print('The average GWAS-H Index for males is: ' +
str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='male']['GWAS-H'].mean(),3)))
print('The average number of papers published by females is: ' +
str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='female']['Papers'].mean(),3)))
print('The average number of papers published by males is: ' +
str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='male']['Papers'].mean(),3)))
print('The average number of citations for females is: ' +
str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='female']['citedByCount'].mean(),3)))
print('The average number of citations for females is: ' +
str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='male']['citedByCount'].mean(),3)))
Lets use the PubMed database returns on collectives to analyze which consortium and study groups are acknowledged:
pubmed_collectives = pd.read_csv(os.path.abspath(
os.path.join('__file__', '../..',
'data', 'PUBMED',
'Pubmed_CollectiveInfo.csv')),
encoding='latin-1')
collect_dict = pd.read_csv(os.path.abspath(
os.path.join(
'__file__', '../..', 'Data',
'Support', 'Collectives',
'Collective_Dictionary.csv')),
encoding='latin-1')
pubmed_collectives['COLLECTIVE'] = pubmed_collectives['COLLECTIVE'].str.strip()
pubmed_collectives['COLLECTIVE'] = pubmed_collectives['COLLECTIVE'].str.lower()
collect_dict['COLLECTIVE'] = collect_dict['COLLECTIVE'].str.strip(
).str.lower()
merged_collect = pd.merge(
pubmed_collectives, collect_dict, how='left', on='COLLECTIVE')
if len(merged_collect[merged_collect['Clean_Name'].isnull()]) > 0:
print('Danger! Correct collectives in Collective_Unverified.csv ' +
'and add to the dictionary\n')
unverified_collect = merged_collect[merged_collect['Clean_Name'].isnull(
)]
unverified_collect.to_csv(os.path.abspath(
os.path.join('__file__', '../..',
'Data', 'Support', 'Collectives',
'Collective_Unverified.csv')))
else:
print('The consortium dictionary is up to date!\n')
consortiumlist = []
for index, row in merged_collect.iterrows():
if pd.isnull(row['Clean_Name']) is False:
for consortium in row['Clean_Name'].split(';'):
consortiumlist.append(consortium)
clean_collectives = pd.DataFrame(consortiumlist, columns=['Consortium'])
groupby_collectives = clean_collectives.groupby(
['Consortium'])['Consortium'].count().sort_values(ascending=False)
holdstring = 'The most frequently seen Consortium are:\n'
for index, row in groupby_collectives.to_frame()[0:10].iterrows():
holdstring = holdstring + index + ' (' + str(row['Consortium']) + ')\n'
print(holdstring[:-2] + '.')
print('\nWe have seen a total of ' +
str(len(groupby_collectives.to_frame())) + ' different collectives!')
print('A total of ' + str(len(pubmed_collectives['PUBMEDID'].drop_duplicates(
))) + ' papers are contributed to by at least one collective.')
print('A total of ' + str(groupby_collectives.sum()) +
' collective contributions are made.')
Lets now simply wrangle together some of the manually collated data and consider the most frequently observed cohorts, with the aim being to analyze the resampling issue discussed earlier in the literature.
finaldataset = pd.read_csv(os.path.abspath(
os.path.join('__file__', '../..', 'data',
'Support', 'Cohorts',
'GWASCat1to1000 final.csv')),
encoding='latin-1',
index_col='Number')
mylist = []
for index, row in finaldataset.iterrows():
mylist = mylist + row['DATASETS'].split(';')
mylist = [x.strip(' ') for x in mylist]
df1 = pd.DataFrame({'Cohorts': mylist})
reader = csv.reader(open(os.path.abspath(os.path.join(
'__file__', '../..', 'data', 'Support', 'Cohorts',
'Dictionary_cohorts.csv')), 'r'))
d = {}
for row in reader:
k, v = row
d[k] = v
for key in d:
df1['Cohorts'].replace(key, d[key], inplace=True)
df1['Cohorts'] = df1['Cohorts'].str.replace(
'Rotterdam Study I (RS-I)', 'Rotterdam Study (RS)')
newdf = pd.DataFrame(df1.groupby(['Cohorts'])[
'Cohorts'].count(), columns=['Count'])
newdf = pd.DataFrame({'Count': df1.groupby(['Cohorts']).size()}).reset_index()
Then merge this with manually collated data:
manual_cohort_for_merge = pd.read_csv(os.path.abspath(
os.path.join('__file__', "../..", "data",
"Support", "Cohorts",
"manual_cohort_for_merge.csv")),
encoding='latin-1', index_col=False)
merged_manual = pd.merge(newdf, manual_cohort_for_merge,
how='left', on='Cohorts')
merged_manual.set_index('Cohorts').sort_values(
by='Count', ascending=False).drop('NO NAME').head(10).style
print('We guess there are a total of ' + str(len(newdf)) +
' different cohorts used in the 1000 biggest studies.')
Lets now print out a list of the 50 most commonly used datasets in case we need to reference any specific particularily prominent ones (e.g. deCODE, UKBioBank etc):
merged_manual=merged_manual.set_index('Cohorts').sort_values(
by='Count', ascending=False).drop('NO NAME')[['Count']]
holder='The 30 most commonly used datasets are:\n\n'
for index, row in merged_manual[0:30].iterrows():
holder = holder + \
str(index) + ' (' + str(row['Count']) + ' times)\n'
print(holder[:-2])
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
PUBMED_by_N = Cat_Ancestry.groupby(['PUBMEDID'])['N'].agg('sum').sort_values(ascending=False)
PUBMED_by_N = PUBMED_by_N[0:1000]
print("We're currently missing " +
str(len(set(PUBMED_by_N.index)-set(finaldataset['PUBMEDID']))) +
" of the biggest 1000 papers grouped by summed ancestry!")
output_path = os.path.abspath(
os.path.join('__file__',
'../..',
'data',
'Support',
'Cohorts',
'MissingLargestCohorts.csv'))
with open(output_path, 'w') as file_handler:
for item in (set(PUBMED_by_N.index)-set(finaldataset['PUBMEDID'])):
file_handler.write("{}\n".format(item))
print('The PUBMEDIDs we are missing is saved in Data/Support/Cohorts/MissingLargestCohorts.csv')